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Abstract 

This  paper  proposes  a  new  approach  to  multi-agent 
systems  leveraging  from  recent  advances  in  networking 
and  reinforcement  learning  to  scale  up  teamwork  based 
on  joint  intentions.  In  this  approach,  teamwork  is  sub¬ 
sumed  by  the  coordination  of  learning  agents.  The  in¬ 
tuition  behind  this  approach  is  that  successful  coordina¬ 
tion  at  the  global  level  generates  opportunities  for  team¬ 
work  interactions  at  the  local  level  and  vice  versa.  This 
unique  approach  scales  up  model-based  teamwork  the¬ 
ory  with  an  adaptive  approach  to  coordination. 


Introduction 

Open  environments  such  as  Peer-to-Peer  (P2P)  and 
wireless  or  Mobile  AdHoc  Networks  (MANET)  pro¬ 
vide  new  challenges  to  communication-based  coordina¬ 
tion  algorithms  such  as  joint  intentions[8]  as  well  as  the 
opportunity  to  scale-up.  Our  framework  is  based  on  the 
proxy  architecture  of  Machinetta[9]  where  proxy  agents 
perform  the  domain-independent  coordination  task  on 
behalf  of  real,  domain-dependent  agents.  This  frame¬ 
work  is  extended  with  a  coordination  mechanism  of  in¬ 
dividual  actions  based  on  reinforcement  learning.  This 
adaptive  proxy  agent  architecture  is  illustrated  in  Fig¬ 
ure  1.  In  this  approach,  local  teamwork  outcomes  pro¬ 
vide  the  feedback  for  learning  the  coordination  task  on 
a  larger  scale.  The  teamwork  theory  of  joint  intentions 
and  its  associated  problems  in  open  environments  are 
presented  first  and  then  our  tentative  approach,  Open- 
MAS,  with  illustration  from  the  fire  fighting  example 
of  the  RoboCup  Rescue  competition  [7]  . 

Joint  intentions  and  Open  Environments 

Joint  intentions[3,  8]  form  the  cornerstone  of  team¬ 
work  theory  of  BDI  (Belief,  Desire,  Intention)  agents. 


Proxy  Agent 


Figure  1.  Adaptive  proxy  agent  architecture 

It  differentiates  joint  actions  from  individual  actions  by 
the  presence  of  a  common  mental  state  (beliefs)  and 
the  joint  commitment  of  achieving  a  goal.  It  is  based 
on  the  communication  of  critical  information  among 
team  members.  Open  environments  are  characterized 
by  their  dynamic  nature  and  the  heterogeneity  of  the 
agents  as  well  as  asynchronous  and  unreliable  commu¬ 
nication  on  a  large  scale.  The  problems  addressed  can 
be  categorized  as  follows:  team  formation,  role  allo¬ 
cation,  synchronization  of  beliefs,  and  communication 
tradoffs. 

1.  Team  Formation.  An  open  environment  gives 
the  opportunity  to  find  teammates  appropriate  for 
a  task  instead  of  relying  on  a  fixed  group  of  agents. 
What  is  the  best  way  to  find  teammates?  When  is 
the  best  time  to  find  teammates?  In  open  environ¬ 
ments,  peers  form  “groups”  by  similarity  of  individ¬ 
ual  interests.  Likewise,  similarity  of  individual  in¬ 
tentions  is  a  necessary  stepping  stone  for  team  for¬ 
mation  in  open  environments.  An  intention  is  de¬ 
fined  here[3]  as  the  decision  to  do  something  in  or¬ 
der  to  achieve  a  goal  and  can  be  construed  as  a 
partial  plan. 
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2.  Role  Allocation.  While  direct  point-to-point 
communication  with  any  node  can  be  expen¬ 
sive  and  uncertain,  access  to  neighbors  is  readily 
available  in  open  environments.  P2P  middle¬ 
ware,  such  as  JXTA  (Juxtapose)  [1],  provides 
the  functionality  needed  to  communicate  reli¬ 
ably  and  cheaply  with  neighbors.  In  MANET, 
the  possibility  of  disconnecting  the  network  is  an¬ 
other  constraint  in  accepting  a  role  requiring  a 
change  in  location.  Figure  2  describe  the  con¬ 
nection  role  that  peers  play  in  communica¬ 
tion  in  MANET.  In  open  environments,  multiple 
teams  are  involved.  How  to  adjust  the  connec¬ 
tivity  role  of  the  agents  so  that  each  team  can 
accomplish  its  goals  most  effectively? 

3.  Synchronization  of  Beliefs.  The  theory  of  joint 
commitments  is  based  on  the  ability  to  synchro¬ 
nize  beliefs  regarding  ‘Svho  is  doing  what”.  Team¬ 
work  breaks  down  when  roles  do  not  match  ex¬ 
pected  beliefs.  How  to  adjust  gracefully  to  delays 
in  communication? 

4.  Communication  Selectivity.  The  tradeoffs  in¬ 
volve  the  robustness  that  redundancy  of  messages 
can  provide  in  open  environments  versus  the  costs 
of  communication  to  the  network.  When  reliable 
communication  cannot  be  assumed,  selective  com¬ 
munication  of  critical  information  might  be  detri¬ 
mental  to  the  coordination  task. 

Synchronization  of  beliefs  and  communication  selectiv¬ 
ity  are  areas  that  are  complicated  by  open  environ¬ 
ments,  while  team  formation  as  well  as  role  allocation 
are  the  problems  we  are  interested  in  addressing  given 
these  complicating  factors. 


Figure  2.  Multi-hop  routing  in  a  MANET 


OpenMAS  Approach 

Our  approach  consists  of  leveraging  from  the  belief 
framework  of  cognitive  agents  at  the  local  level  but 
endowing  the  agents  with  the  adaptative  capabilities 


of  reinforcement  learners  as  an  additional  coordination 
mechanism  at  the  global  level  where  communication 
is  unsure  and  unreliable.  The  overarching  issues  ad¬ 
dressed  are  (1)  how  to  integrate  general  models  of  co¬ 
operation  with  reinforcement  learning  in  distributed, 
open  environments  (2)  what  are  good  metrics  for  the 
propagation  of  beliefs  to  heterogenous  agents  and  (3) 
how  to  integrate  multiple  goals. 

Methodology 

Through  the  propagation  of  beliefs,  the  agents  have 
some  knowledge  of  the  global  situation  albeit  imperfect 
and  decaying  with  time.  This  capability  relaxes  the  in¬ 
validation  of  the  Markov  property  for  multi-agent  rein¬ 
forcement  learning  systems.  Instead  of  committing  to 
a  non-local  role,  the  agents  just  commit  to  the  next 
individual  step.  This  is  a  least-commitment  approach 
that  addresses  the  problems  outlined  above  of  team¬ 
work  in  open  environments.  Local  environmental  be¬ 
liefs  on  the  other  hand  trigger  a  role  allocation  mech¬ 
anism^  among  neighbors  sharing  the  same  beliefs.  The 
joint  actions  generated  have  precedence  over  the  indi¬ 
vidual  actions  generated  by  the  coordination  learning 
mechanism.  Similarities  between  joint  actions  and  in¬ 
dividual  actions  produce  the  terminal  rewards  needed 
for  the  learning  algorithm.  In  this  approach,  there  is  a 
tight  integration  between  the  local  level  of  teamwork 
and  the  global  level  of  coordination.  The  overall  ap¬ 
proach  is  described  in  Algorithm  1.  Figure  3  illustrates 
the  approach  in  the  fire  fighting  excimple. 


Algorithm  1  Intention/action  loop 

INPUT:  intentions 
OPENMAS-interpreter: 

do 

<information,intention>  t— receive-information() 
if  similar-intentions  (intention) 
accept- information() 
update-current-state() 
forget-  and-  predict  ( ) 
takeNextStepO 

propagate  <next  step,intentions>  tuples 

forever 

The  information  received  includes  information  from 
peers  and/or  perceived  information  from  the  environ¬ 
ment. 


The  environment  of  agents  acting  under  uncertainty 
can  be  conveniently  modelled  as  a  POMDP  (Partially- 


1  Role  allocation  of  mutually  exclusive  tasks  among  agents  can 
be  modelled  with  a  distributed  resource  allocation  algorithm 
similar  to  the  drinking  philosophers  problem^}. 


Role  Allocation 


W  =  {S,  S',A,T,R} 


The  agents  propagate  changes  of  position  and  changes 
in  the  fires’  status  to  their  neighbors  recursively  ac¬ 
cording  to  a  time-to-live  (TTL)  parameter.  Role  allo¬ 
cation  strategies  resolve  local  conflicts. 

Figure  3.  Fire  fighting  example 


observable  Markov  Decision  Process).  POMDP  can  be 
reformulated  as  continuous-space  Markov  decision  pro¬ 
cesses  (MDPs)  representing  belief  states[6]  and  solved 
using  an  approximation  technique.  When  propagating 
local  environmental  beliefs,  the  redundancy  of  mes¬ 
sages  reinforces  the  current  state  beliefs  while  decay¬ 
ing  with  time.  Propagated  location  information  is  up¬ 
dated  through  the  same  prediction  mechanism  used  to 
select  the  next  action  of  the  agents.  Forgetting  and  pre¬ 
diction  are  the  two  tools  enabling  the  synchronization 
of  beliefs  through  asynchronous  and  unreliable  commu¬ 
nication.  The  most  likely  state  of  the  global  situation  is 
then  modeled  as  an  MDP  and  the  action  to  take  deter¬ 
mined  by  a  stochastic  policy  approximated  by  a  policy 
gradient  method  [11]. 

In  addition  to  fighting  and  searching  for  fires,  the 
firetrucks  (the  agents)  have  the  additional  task  of  main¬ 
taining  connectivity  of  the  network.  It  is  necessary  to 
balance  those  sometimes  conflicing  goals.  The  synergy 
of  those  two  goals  should  maintain  a  proper  degree  of 
dispersion  among  the  agents.  In  this  context,  multi¬ 
ple  MDPs  model  the  different  intentions  of  the  agents. 
An  MDP  represents  the  belief  map  of  the  agents’  loca¬ 
tion  while  another  MDP  represents  the  belief  map  of 
the  location  of  the  fires.  The  action  to  take  is  the  best 
action[5]  across  those  MDPs  after  a  period  of  explo¬ 
ration. 


Problem  Modeling 

The  world  is  modeled  as  the  problem  space: 


where 

•  5  is  the  believed  perceived  local  state  of  the  world. 

•  S'  is  the  believed  global  state  of  the  world  through 
propagation  of  information. 

•  A  is  the  set  of  actions. 

•  T  is  the  set  of  transition  probabilities 

5  X  A  X  5  ^  [0, 1] 

•  i?  is  the  set  of  roles, 
and 

St  ^  R  — t  Aj 
S't  X  A 

where 

•  Aj  is  the  action  determined  to  achieve  role  R. 

•  Aj  is  the  action  determined  by  coordination  in  the 
believed  state  space  S'. 

A  reward  is  obtained  if  Aj  =  Aj. 

Related  Work 

The  dissemination  of  information  enables  agents  to  ob¬ 
tain  some  global,  though  imperfect,  knowledge  of  the 
world.  This  capability  is  taken  into  account  in  scal¬ 
ing  up  teamwork  approaches  based  on  communication 
and  our  approach  also  takes  this  capability  into  account 
to  enhance  multi-agent  learning.  Our  approach  is  dif¬ 
ferent  from  the  large-scale  coordination  of  Machinetta 
proxies[10]  because  (1)  the  uncertainty  due  to  delays  in 
communication  is  taken  into  account  and  (2)  individ¬ 
ual  actions  lead  to  joint  actions  through  online  adap¬ 
tation. 

Our  approach  is  also  related  to  learning  approaches 
of  plan  competencies  in  EDI  multi- agent  systems  [4] 
where  plan  successes  or  failures  trigger  explanation- 
based  learning  to  modify  the  plan.  Our  approach  how¬ 
ever  does  assume  a  correct  and  complete  plan  library 
and  success  of  the  task  is  dependent  only  on  the  coor¬ 
dination  task. 


Conclusions 

Open  environments  such  as  P2P  and  MANET  forces  a 
reexamination  of  teamwork  in  large  scale  systems  re¬ 
lying  more  on  adaptive  coordination  than  explicit  co¬ 
operation  requiring  synchronization  points.  The  capa¬ 
bility  to  acquire  global,  albeit  imperfect,  knowledge 
through  the  propagation  of  information  makes  it  possi¬ 
ble  to  use  independent  reinforcement  learners  for  coor¬ 
dination  tasks  in  multi- agent  systems.  Similarity  of  in¬ 
tentions  can  help  relieve  the  burden  placed  on  the  net¬ 
work  by  selectively  propagating  information.  A  local 
teamwork  model  drives  the  rewards  of  the  overall  coor¬ 
dination  task.  This  approach  scales  well  to  any  dimen¬ 
sions  and  its  precision  can  be  modulated  by  the  TTL 
parameter.  This  approach  will  be  compared  quanta- 
tively  with  centralized  and  omniscient  algorithms  and 
variations  in  the  network  reliability  in  future  work. 
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